Topic 8 - Multivariate analyses

Ecological phenomena are inherently complex, and it is rare that a single variable is sufficient to describe an ecological system. Therefore, it is common to deal with:

  • multiple response variable

  • multiple explanatory variables

Multivariate analyses may reveal patterns that would not be detectable by combination of univariate methods.

Comparaison between univariate and multivariate analyses: multivariate methods allow the evaluation of all response variables simultaneously. R1,…,i: Response variables; E1,…,i: Explanatory variables; S1,…,i: Samples

Comparaison between univariate and multivariate analyses: multivariate methods allow the evaluation of all response variables simultaneously. R1,…,i: Response variables; E1,…,i: Explanatory variables; S1,…,i: Samples

Applications

Five types of scientific inquiries usually suit to the application of multivariate methods.

  • Data reduction or structural simplification: summarize multiple variables through a comparatively smaller set of ‘synthetic’ variables. Ex: Principal Component Analysis (PCA)

  • Sorting and grouping: many ecological questions are concerned with the similarity or dissimilarity. Ex:Cluster analysis, non Metric Dimensional Scaling (nMDS)

  • Investigation of the dependence among variables: dependence among response variables, among explanatory variables, or among both. Ex: Redundancy analysis and other constrained analysis

  • Prediction: once the dependence detected and characterized multivariate models may be constructed in a very similar as we did before with univariate models.

  • Hypothesis testing: detect and test pattern in the data (be careful of data dredging) . Ex: MANOVA, PERMANOVA, ANOSIM

Correspondance analysis of methods usage in various scientific fields: axis 1 differentiate microscopic from macroscopic organisms. Cluster, PCA, etc are exploratory methods and apply preferentially to small organisms.

Correspondance analysis of methods usage in various scientific fields: axis 1 differentiate microscopic from macroscopic organisms. Cluster, PCA, etc are exploratory methods and apply preferentially to small organisms.

Typical Data Structure

In ecology, ‘typical’ data structure will be:

  • objects in row (e.g. samples can be sites, time periods, etc.)

  • measured variables for those objects in columns (e.g. species, environmental parameters, etc.)

Important note: observations on object are not necessarily independent on those made on another object, and a mixture of dependent and independent objects is possible e.g. site and year

Data transformation

Measured variables can be binary, quantitative, qualitative, rank-ordered, or even a mixture of them.

If variables do not have uniform scale (e.g. environmental parameters measured in different units or scales), they usually have to be transformed before performing further analyses.

  • Standardization: provides dimensionless variables and removes the influence of magnitude differences between scales and units

  • Normalization: aims at correcting the distribution shapes of certain variables.

    • arcsin (x) family of transformations for percentage and proportions

    • log (x + constant) for variables departing moderately from normal distribution

    • sqrt (x + constant)for variables departing slightly from normal distribution

    • Hellinger transformation to make data containing many zeros suitable for PCA or RDA, the ‘double zeros’ problem.

    • Chord transformation to give less weight to rare species (especially when rare species are not truly rare)

The function decostand from the vegan package offers an easy way to transform your data. The varespec data frame has 24 rows and 44 columns. Columns are estimated cover values of 44 lichen species. The variable names are formed from the scientific names, and are self explanatory for anybody familiar with vegetation type / lichen species.

##    Callvulg Empenigr Rhodtome Vaccmyrt Vaccviti
## 18     0.55    11.13     0.00     0.00    17.80
## 15     0.67     0.17     0.00     0.35    12.13
## 24     0.10     1.55     0.00     0.00    13.47
## 27     0.00    15.13     2.42     5.92    15.97
## 23     0.00    12.68     0.00     0.00    23.73
##    Callvulg Empenigr Rhodtome Vaccmyrt Vaccviti
## 18        1        1        0        0        1
## 15        1        1        0        1        1
## 24        1        1        0        0        1
## 27        0        1        1        1        1
## 23        0        1        0        0        1

Ressemblance

  • Most methods of multivariate analysis are explicitly or implicitly based on the comparison of all possible pairs of objects or descriptors.

  • Comparison takes the form of association measures which are assembled in a square and symmetrical association matrix of dimension \(n\) x \(n\) when objects are compared, or \(p\) x \(p\) when variables are compared. The choice of a suitable association coefficient is crucial for further analysis.

  • When pairs of objects are compared, the analysis is said to be in Q mode.When pairs of descriptors are compared, the analysis is said to be in R mode.

Q mode

Virtually, all distance or similarity measures used in ecology are symmetric: the coefficient between pairs of objects is the same!

But whatto deal with double-zeros?

  • the zero value has the same meaning as any other values (e.g. 0mg/L of O2 in deep anoxic layer of a lake)

  • the zero value in matrix of species abundances (or presence-absence) can not always be counted as an indication of resemblance (presence has an ecological meaning, but no conclusions on the absence: e.g. is the absence of a given nationality in this class means that no students from this specific country are in NTU? And is it an element to evaluate the similarity with other university (high similarity because many nationalities probably absent ? No, but at same sample size 1/0 becomes informative)

Because of the double-zeros problem, 2 classes of association measures exist based on how they deal with this issue".

  • The symmetrical coefficients will consider the information from the double-zero (also called ‘negative matched’).

  • The asymmetrical coefficients will ignore the information send from the double-zero.

When analyzing species data, it is often recommended to use asymmetrical coefficients unless you have reason to consider each double absence in the matrix (e.g. controlled experiment with known community composition or ecologically homogenous areas with disturbed zones)

Presence/absence-based dissimilarity metrics

We use Jaccard Similarity (S7) to find similarities between sets. So first, let’s learn the very basics of sets.

Now going back to Jaccard similarity.The Jaccard similarity measures similarity between finite sample sets, and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B.

Other distances applying to presence-Absence data: Sørensen (S8), Ochiai (S14)

Abundance-based dissimilarity metrics

When your community data samples include abundance information (as opposed to simple presence-absence) you have a wider choice of metrics to use in calculating (dis)similarity.

When you have abundance data your measures of (dis)similarity are a bit more “refined” and you have the potential to pick up patterns in the data that you would otherwise not see using presence-absence data.

There are many metrics that you might use to explore (dis)similarity. Four of them are particularly common:

  • Bray-Curtis
  • Canberra
  • Manhattan
  • Euclidean

You can get the spreadsheet here to examine how to compute them in details

Euclidean distance (D2) is the most commonly-used of our distance measures. For this reason, Euclidean distance is often just to referred to as “distance”. When data is dense or continuous, this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.This distance between two points is given by the Pythagorean theorem.

Here the abundance of a species from one sample is subtracted from its counterpart in the other sample. Instead of ignoring the sign, the result is squared (which gives a positive value):

\(E_d=\sqrt{\sum (x_i-y_j)^2}\)

Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In simple way of saying it is the absolute sum of difference between the x-coordinates and y-coordinates. Suppose we have a Point A and a Point B: if we want to find the Manhattan distance between them, we just have to sum up the absolute x-axis and y–axis variation. We find the Manhattan distance between two points by measuring along axes at right angles.

This is the simplest dissimilarity metric to compute:

\(CB_d = \sum|x_i-x_j|\)

Bray-Curtis (D14) dissimilarity is the golden ditance metric in ecology.At first, you subtract the abundance of one species in a sample from its counterpart in the other sample but ignore the sign. The second component is the abundance of a species in one sample added to the abundance of its counterpart in the second sample. If a species is absent, then its abundance should be recorded as 0 (zero).

\(BC_d = \frac {\sum |x_i-x_j|}{\sum(x_i+x_j)}\)

The Canberra dissimilarity uses the same components as Bray-Curtis but the components are summed differently:

\(C_d = \sum \frac { |x_i-x_j|}{(x_i+x_j)}\)

Many other ‘distances’ exist, each with their code

Those distance can be computed from am un-transformed or transformed matrix.

Computation

The functions vegdist from the veganpackage and dist from the stats package compute dissimilarity indices useful and popular among community ecologists.

## [1] 0.5310021 0.6680661 0.5621247 0.3747078 0.5094738 0.6234419
## [1] 0.3619358 0.4969291 0.3853004 0.3145055 0.3152830 0.3932524
## [1] 0.5518126 0.6834930 0.5978450 0.3793678 0.4968567 0.6448499
## [1] 0.3613332 0.5038012 0.4264724 0.3032106 0.3319945 0.4187854
## [1] 0.3333333 0.4705882 0.5000000 0.3333333 0.4166667 0.4000000
## [1] 0.3333333 0.4705882 0.5000000 0.3333333 0.4166667 0.4000000
## [1] 0.2000000 0.3076923 0.3333333 0.2000000 0.2631579 0.2500000
## [1] 0.4458781 0.5504882 0.5757576 0.4458781 0.5128786 0.4995210

Matrix visualization

Using the package gclus and the function coldiss (Borcard et al. 2011) dissimilarities can be easily visualized in a heat map

In the untransformed distance matrix, the small differences in abundant species have the same importance as small differences in species with few individuals

Distance matrices are also commonly visualized through a network:

Note 1: In Q mode similarity from binary data can be interpret by a simple matching coefficient S1: for each pair of sites, it is the ratio between the number of double 1s plus double 0s and the total number of variables.

Note 2: For mixed types variables, including categorical or qualitative multiclass variables use Gower’s similarity (S15). It is easily computed in R using daisy function built in the cluster package. Avoid vegdist with method='gower', which is appropriate for quantitative and presence-absence, but not for multiclass variables. Overall, gowdis from the package FD is the most complete function to compute Gower’s coefficient in R, and commonly used in trait-based approach analyses.

R mode

Correlation type coefficents are commonly used to compare variable in R mode. Remember:

  • Parametric (Pearson coefficient)

  • Non-parametric (Spearman, Kendall for quantitative or semi-quantitative data)

  • Chi-square statistic + its derived forms for qualitative variables

  • Binary coefficient such as Jaccard, Sorensen, and Ochiai for presence-absence data

In R mode, the use of Pearson coefficient is very common. Applied on binary variables, r Pearson is called the point correlation coefficient. Using the function panelutils (Borcard et al. (2011):

RP16: Load tikus data set from the package mvabund. Extract coral species and environmental (years) data, and select samples from year 1981, 1983, and 1985 only.

  • From the species data, calculate two distance matrices: (1) one using Bray-Curtis and, (2) one using Euclidean distance on log-transformed data

  • Plot heat maps of respective matrix and compare. Briefly explain the differences.

Cluster analysis

We often aim to recognize discontinuous subsets in an environment which is represented by discrete (taxonomy) changes but perceived as continuous changes in ecology.

Clustering consists in partitioning the collection of objects (or descriptors in R-mode). Clustering does not test any hypothesis.

Clustering is an explanatory procedure which helps to understand data with complex structure and multivariate relationships, and is a very useful method to extract knowledge and information especially from large datasets.

Many clustering approaches rely on association matrix, which stresses on the importance of the chhoice of a appropriate association coefficient.

Families

  1. Sequential or Simultaneous algorithms (most of the clustering algorithm)

  2. Agglomerative or Divisive

  3. Monothetic (common prop.) versus Polythetic
  4. Hierarchical versus Non-hierarchical (flat)

  5. Probabilistic (decision tree) versus non-probabilistic methods

  6. Hard and Soft (may overlap)

Calculate numerical classification requires two arguments: matrix of distances among samples (ecological resemblance, D) and the method: name of clustering algorithm.

Common hierarchical clustering methods are available through the function hclust from the stats package. Be careful of:

  • Data dredging clustering produces tree-like structure. You don’t choose clustering method according to how your tree looks like.

  • The suitable method is usually carefully selected and/or evaluated accoding to the data set you are dealing with and your hypotheses.

Hierarchical Clustering

  • Single linkage agglomerative clustering (neighbor sorting)

    • commonly produced chain dendrograms: a pair is linked to a third object

    • agglomerates objects on the basis of their shortest pairwise distance

    • partitions difficult to interpret, but gradients quite clear

e.g. Single linkage agglomerative clustering computed on chord distance matrix

Single linkage allows object to agglomerate easily to a group since a link to a single object of the group suffices to induce fusion. This is the ‘closest’ friend’ procedure.

  • Complete linkage agglomerative clustering (furthest neighbor sorting)

    • dendrograms look a bit like rakes – some cluster merge together at the highest dissimilarity

    • allow an object (or a group) to agglomerate with another group only at the distance corresponding to that of the most distant pair of objects

e.g. Complete linkage agglomerative clustering computed on chord distance matrix

A group admits a new member only at the distance corresponding to the furthest object of the group: one could say that the admission requires unanimity of the group.

  • Average agglomerative clustering

    • it is the method the most common in ecology (species data)

    • this family comprises four methods that are based on average dissimilarities among objects or on the centroids of cluster,

e.g. Average agglomerative clustering (UPGMA) computed on chord distance matrix

  • Ward’s Minimum Variance clustering

  • Ward method is also a favorite clustering method

  • it is based on linear model criterion of least square: within-group sum of square (i.e. the square error of ANOVA) is minimized.

Note: Ward method is based on Euclidean model. It should not be combined with distance measures, which are not strictly metric such as the popular Bray-Curtis distance.

  • Flexible clustering

hclust is called in this algorithm, but flexible clustering is available in the R package cluster with the function agnes

  • Beta flexible methods

  • Model encompassing all the clustering methods seen before, which are obtained by changing the values of four parameters

Clustering Quality

There are many ways to evaluate the overall quality of the chose clustering algorithm and therefore their representations.

  • the cophenetic correlation related distances extracted from the dendrogram (function cophenetic on a hclust object) with ditances in our original distance matrices. A higher correlation means a better representation of the initial matrix.
## [1] 0.6684868
## [1] 0.8212132
## [1] 0.8586625
## [1] 0.716064
  • the shepard-like diagram is a plot that represents orginal distances against the cophentic distances. Can be combined with cophentic correlation seen above.

## null device 
##           1
  • the Fusion Level Values. Examine values where a fusion between two branches of a dendrogram occurs. This graph is useful whenver you want to define an interpretable cutting levels.

The graph of fusion level values shows clear jump after each fusion between 2 and 6 groups Let’s cut our dendrogram at the corresponding distance Do the groups makes sense? Do you obtain groups containing a substantial number of sites?

Let’s repeat the same for all the clustering methods:

All graphs look a but different: there is no single ‘truth’ among these solution and each may provide insight onto the data. However the Chord UPGMA received the best supported in both the cophenetic correlation and sheppard-like diagram. Therefore, fusion levels may have have a stronger support examined with this cluster approach.

One can compared classification by cutting tree cutree and comparing grouping using contingency tables:

##                spe.ch.complete.g
## spe.ch.single.g 1 2 3 4 5
##               1 5 7 0 0 0
##               2 0 1 0 7 0
##               3 0 0 2 0 0
##               4 0 0 1 0 0
##               5 0 0 0 0 1

If two classifications provided the same group contents, the contingency table would show only non zero frequency value in each row and each column0 It is never the case here.

  • the Silhouette widths indicator.

    • The silhouette width is a measure of the degree of membership of an object to its cluster based on the average distance between this object and all objects of the cluster to which is belongs, compared to the same measure for the next closest cluster.

    • Silhouette widths range from 1 to 1 and can be averaged over all objects of a partition

    • The greater the value is, the greater the better the object is clustered Negative values mean that the corresponding objects have probably placed in the wrong cluster (intra group variation inter group variation).

  • the Mantel test

    • Compares the original distance matrix to binary matrices computed from dendrogram cut at various level

    • Chooses the level where the matrix ( correlation between the two is the highest

    • The Mantel correlation is in its simplest sense, i.e. the equivalent of a Pearson r correlation between the values in the distance matrices

Comparison between the distance matrix and binary matrices representing partitions

  • the Elbow method

    • This method looks at the percentage of variance explained (SS) as a function of the number of cluster

    • One should choose a number of clusters so that adding another cluster doesn’t give much better explanation

At some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion” (wss) .

  • And many others indicators

Clustering options

  • Visualize groupings

  • Spatial clustering (example)

## null device 
##           1
  • Heat map & clustering visualization (example)

We must reorder objects (function reorder.hclust) so that their order in the dissimilarity matrix is respected . This does not affect the topology of the dendrogram.

RP17: On tikus data

  • Using Bray-Curtis dissimilarity, compute the four common clustering methods and compare them using cophenetic correlation and Shepard-like diagram

  • Choose method with the highest cophenetic correlation and produce a heat map of the distance matrix reordered with the dendrogram

Non-Hierarchical Clustering

(NOTE ON FUZZY)

  • Create partition, without hierarchy.

  • Determine a partition of the objects into k groups, or clusters, such as the objects within each cluster are more similar to one other than to objects in the other clusters.

  • It require an initial configuration (user usually determine the number of groups, k), which will be optimized in a recursive process (often random). If random, the initial configuration is run a large number of times with different initial configurations in order to find the best solution.

The most known and commonly used non-hierarchical partitioning algorithms is k-means clustering (MacQueen, 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster.

Three crital steps:

  • Initialization (various Methods: Lloyd’s algorithm is the mots common): k observations from the dataset are used as the initial means. The random partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster’s randomly assigned points.

  • Assignment step Assign each observation to the cluster with the nearest mean: that with the least squared Euclidean distance (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means)

  • Update step Recalculate means (centroids) for observations assigned to each cluster.

The algorithm has converged when the assignments no longer change. The algorithm is not always guaranteed to find the optimum.

The algorithm is often presented as assigning objects to the nearest cluster by distance. Using a different distance function other than (squared) Euclidean distance may prevent the algorithm from converging. Various modifications of k-means such as k-medoids [PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990] have been proposed to allow using other distance measures. In this case, cluster is represented by one of the objects in the cluster.

Note: if variable in the data table are not dimensionally homogenous, they must be standardized prior to partitioning

Let’s play card to simply illustrated what is going one: check here

Implementing k-means/PAM partitioning in R

The aims is to identify high-density regions in the data. To do so, the method iteratively minimizes an objective function the total error sum of squares (TESS or SSE), which is the sum of the within groups sums-of squares. It is basically the sum, over the k groups, of the sums of squared distance among the objects in the group, each divided by the number of objects in the group.

With a pre-determined number of groups, recommended function is: kmeans from the stats package. Argument nstart will repeat the analysis a large number of time using different initial configuration until finding the best solution.

Note: do not forget to normalized your data

## 18 15 24 27 23 19 22 16 28 13 14 20 25  7  5  6  3  4  2  9 12 10 11 21 
##  4  2  2  2  5  3  1  1  2  4  1  5  2  4  4  4  3  3  3  3  3  3  3  5
##    spe.ch.UPGMA.g
##     1 2 3 4 5
##   1 0 2 0 1 0
##   2 0 5 0 0 0
##   3 0 0 8 0 0
##   4 5 0 0 0 0
##   5 0 2 0 0 1

Fuzzy clustering

The fuzzy clustering is considered as soft clustering ors or soft k-means, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster. In fuzzy clustering, points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value varying from 0 to 1.

This is fundamnetally different from k-means and k-medoid clustering, where each object is affected exactly to one cluster. k-means and k-medoids clustering are known as hard or non-fuzzy clustering.

In other words, in non-fuzzy clustering an apple can be red or green (hard clustering). Here, the apple can be red AND green (soft clustering). The apple will be red to a certain degree [red = 0.5] as well as green to a certain degree [green = 0.5].

The fuzzy c-means (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. The centroid of a cluster is calculated as the mean of all points, weighted by their degree of belonging to the cluster. The function fanny [cluster package] can be used to compute fuzzy clustering. ‘FANNY’ stands for fuzzy analysis clustering.

## Fuzzy Clustering object of class 'fanny' :                      
## m.ship.expon.        2
## objective     3.792824
## tolerance        1e-15
## iterations         194
## converged            1
## maxit              500
## n                   24
## Membership coefficients (in %, rounded):
##    [,1] [,2] [,3]
## 18   45   33   21
## 15   35   49   16
## 24   35   44   20
## 27   34   48   18
## 23   39   44   18
## 19   33   34   33
## 22   36   43   21
## 16   38   42   20
## 28   34   45   21
## 13   41   32   27
## 14   39   38   24
## 20   40   43   17
## 25   34   47   18
## 7    45   32   23
## 5    42   32   26
## 6    44   31   25
## 3    20   17   63
## 4    31   24   45
## 2    15   13   72
## 9    18   16   66
## 12   15   13   72
## 10   16   14   70
## 11   33   28   39
## 21   36   34   30
## Fuzzyness coefficients:
## dunn_coeff normalized 
## 0.39226789 0.08840183 
## Closest hard clustering:
## 18 15 24 27 23 19 22 16 28 13 14 20 25  7  5  6  3  4  2  9 12 10 11 21 
##  1  2  2  2  2  2  2  2  2  1  1  2  2  1  1  1  3  3  3  3  3  3  3  1 
## 
## Available components:
##  [1] "membership"  "coeff"       "memb.exp"    "clustering"  "k.crisp"    
##  [6] "objective"   "convergence" "diss"        "call"        "silinfo"    
## [11] "data"
##   cluster size ave.sil.width
## 1       1    7          0.17
## 2       2   10          0.30
## 3       3    7          0.57

Another example using the function cmeans from the package e1071 and a dataset on the criminality in USArrests

Four fuzzy clusters along the Doubs River

Four fuzzy clusters along the Doubs River

RP18: Let’s go back to our flowers and the iris dataset. The data set contains variable sepal length, sepal width, petal length, and petal width of flowers of different species

  • Using k-means cascade and silhouette width indicator determine the optimal number of clusters?

  • Since, we know that that there are 3 species are involved, we ask the algorithm to group the data into 3 clusters (common sense). How many points are wrongly classified? Plot the results

Decision trees

Here for a basic understanding of decision trees:

Classification And Regression Trees (CART)

The function ctree builds decision trees (recursive partitioning algorithm). The first parameter is a formula, which defines a target variable and a list of independent variables.

## 
## Classification tree:
## tree(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + 
##     Petal.Width, data = iris)
## Variables actually used in tree construction:
## [1] "Petal.Length" "Petal.Width"  "Sepal.Length"
## Number of terminal nodes:  6 
## Residual mean deviance:  0.1253 = 18.05 / 144 
## Misclassification error rate: 0.02667 = 4 / 150

Another fancier option with the package rpart:

One of the disadvantages of decision trees may be overfitting i.e. continually creating partitions to achieve a relatively homogeneous population. This problem can be alleviated by pruning the tree (CART), which is basically removing the decisions from the bottom up. Another way is to combine several trees and obtain a consensus, which can be done via a process called random forests (bootstrapped version of CART - many trees built based on subsets of the data, in addition not all predictor variables are used every time, rather a random subset).

Multivariate Regression Trees: constrained clustering

Multivariate regression trees (MRT; De’ath 2002) are an extension of univariate regression trees, a method allowing the recursive partitioning of a quantitative response variable under the control of a set of quantitative or categorical explanatory variables (Breiman et al. 1984). Such a procedure is sometimes called constrained or supervised clustering.The result is a tree whose “leaves” (terminal groups of sites) are composed of subsets of ‘sites’ chosen to minimize the within-group sums of squares (as in a k-means clustering), but where each successive partition is defined by a threshold value or a state of one of the explanatory variables.

The only package implementing a complete and handy version of MRT is mvpart. Unfortunately, this package is no longer supported by the R Core Team, so that no updates are available for R versions posterior to R 3.0.3. Nevertheless, mvpart can still be installed via github.

## X-Val rep : 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100
## Minimum tree sizes
## tabmins
##  2  3  4  8  9 10 11 
##  2  4  7 22 17 37 11

Hello World to Machine Learning

Create a Validation Dataset We will split iris dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

You now have training data in the dataset variable and a validation set we will use later in the validation variable.

Evaluate Algorithm it is time to create some models of the data and estimate their accuracy on unseen data.

  • Step 1: set-up the test harness to use 10-fold cross validation. We will 10-fold crossvalidation to estimate accuracy.This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate

We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.

  • Step 2: Build Models. We don’t know which algorithms would be good on this problem or what configurations to use, let’s evaluate 3 different algorithms:

    • Classification and Regression Trees (CART) - non linear algorithm
    • k-Nearest Neighbors (kNN) - non linear algorithm
    • Linear Discriminant Analysis (LDA) - linear algorithm
    • Random Forest (RF) - advanced algorithms
  • Step 3: Build Models. We don’t know which algorithms would be good on this problem or what configurations to use, let’s evaluate 3 different algorithms:

We now have 4 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

We can report on the accuracy of each model by first creating a list of the created models and using the summary function.

##           Min.   1st Qu.    Median      Mean 3rd Qu. Max. NA's
## lda  0.9166667 0.9375000 1.0000000 0.9750000       1    1    0
## cart 0.8333333 0.9166667 0.9166667 0.9416667       1    1    0
## knn  0.8333333 0.9166667 1.0000000 0.9583333       1    1    0
## rf   0.8333333 0.9166667 0.9166667 0.9416667       1    1    0

We can see that the most accurate model in this case was LDA:

## Linear Discriminant Analysis 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results:
## 
##   Accuracy  Kappa 
##   0.975     0.9625

This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 97.5% accuracy +/- 4%

Prediction

The LDA was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.

We can run the LDA model directly on the validation set and summarize the results in a confusion matrix.

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          0        10
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8843, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : 4.857e-15  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

We can see that the accuracy is 100%. It was a small validation dataset (20%), but this result is within our expected margin of 97% +/-4% (Accuracy +/- SD of lda model) suggesting we may have an accurate and a reliably accurate model.

See more https://machinelearningmastery.com/machine-learning-in-r-step-by-step/

Basically, we learn about one neurone of a network. You are only one step toward Deep Learning.

Machine Learning vs. Deep Learning

Machine Learning vs. Deep Learning